Rex generated scanners automatically provide the line and column position of each token. For languages like Pascal and Ada where the case of letters is insignificant tokens can be normalized to lower or upper case. There are predefined rules to skip white space like blanks, tabs, or newlines and there is a mechanism to handle include files. The generated scanners are table-driven deterministic finite automatons.
if output is in C: <Scanner>.h specification of the generated scanner <Scanner>.c body of the generated scanner <Scanner>Source.h specification of support module source <Scanner>Source.c body of support module source <Scanner>Drv.c main program to serve as test driver if output is in Modula-2: <Scanner>.md definition module of the generated scanner <Scanner>.mi implementation module of the generated scanner <Scanner>Source.md definition module of support module source <Scanner>Source.mi implementation module of support module source <Scanner>Drv.mi main program to serve as test driver <Scanner>.Tab tables to control the generated scanner
J. Grosch: "Efficient Generation of Lexical Analysers", Software - Practice & Experience, 19 (11), 1089-1103, Nov. 1989 Rex is implemented by a 5,000 line Modula-2 program. The program makes heavy use of a library of reusable Modula-2 modules currently comprising 3,000 lines of code [Gro87]. Of the 5,000 lines of Rex 1,500 lines are generated by tools: 500 lines for the scanner are generated by Rex itself. 1000 lines for the parser are generated by the LALR(1) parser generator How can Rex generate a part of itself before its existence? Well, the scanner has been bootstrapped using LEX. The first version of the scanner was a separate C program generated by LEX which wrote the internal representation of the tokens on a file. A simple hand written scanner read the tokens from this file during construction of Rex. After Rex was operational it could generate its own scanner in Modula-2. And how is Rex working? It differentiates between constant regular expressions and non-constant ones as defined in [Gro89]. The non-constant regular expressions constitute a nondeterministic finite automaton. The so-called subset construction algorithm is used for conversion into a deterministic finite automaton. Then an algorithm to minimize the number of states is applied. After extending the automaton to a tunnel automaton the constant regular expressions are added in linear time using the algorithm described in [Gro89]. The sparse matrix to control the automaton is compressed into a data structure called "comb vector" [ASU86] to save space. The key to the performance of scanners generated by Rex lies in the following facts: access to the "comb vector" table is fast input happens rarely because blocks of characters are transferred no check for the last character of a block is necessary because of the sentinel technique used the same holds for the check of stack underflow for the stack to record the passed states the treatment of right context is efficient and only necessary in a few cases because partial evaluation has been applied Some specialists might want to know about the differences between Rex and LEX [Les75] (see Table):
Meaning | LEX | Rex |
delimiter for character classes | [ ] | { } |
complement of character classes | [^ ] | - { } |
any character | . | ANY |
left justification | ^ | < |
right justification | $ | > |
replicator | {n} | [n] |
replicator | {m,n} | [m-n] |
delimiter for start states | < > | # # |
escape representation for characters | \\\\\\\\octal | \\\\\\\\decimal |
scanner routine | yylex | <Scanner>_GetToken |
access to matched string | yytext | <Scanner>_GetWord ( ) |
length of matched string | yyleng | result of <Scanner>_GetWord ( ) |
output of matched string | ECHO | yyEcho |
retain part of matched string | yyless | yyLess |
initial start state | INITIAL | STD |
change of start state | BEGIN | yyStart ( ) |
specification : [name] [code] [define] [start] rules . name : SCANNER [Ident] . code : | code EXPORT TargetCode | code GLOBAL TargetCode | code LOCAL TargetCode | code BEGIN TargetCode | code CLOSE TargetCode | code DEFAULT TargetCode | code EOF TargetCode . define : DEFINE definition * . start : START identList . rules : RULE rule * | RULES rule * . definition : Ident '=' regExpr '.' . identList : Ident | identList Ident | identList ',' Ident . rule : patternList ':' TargetCode | patternList ':-' TargetCode . patternList : pattern | patternList ',' pattern . pattern : [startStates] ['<'] regExpr ['/' regExpr] ['>'] . startStates : '#' identList '#' | NOT '#' identList '#' . regExpr : regExpr '|' regExpr | regExpr regExpr | regExpr '+' | regExpr '*' | regExpr '?' | regExpr '[' Number ']' | regExpr '[' Number '-' Number ']' | '(' regExpr ')' | charSet | Char | Ident | String | Number . charSet : '-' charSet | '{' range * '}' . range : Char | Char '-' Char . Char : character | '\' digit + | '\' n | '\' t | '\' v | '\' b | '\' r | '\' f | '\' character . Ident : letter letter_or_digit * . letter_or_digit : letter | digit | '_' . String : '"' character * '"' . Number : digit + . Target_code : '{' character * '}' . GLOBAL { # include "Memory.h" # include "StringMem.h" # include "Idents.h" int level = 0; void ErrorAttribute (Token, Attribute) int Token; tScanAttribute Attribute; { } } LOCAL { char Word [256]; tIdent ident ; tStringRef ref ; int length ; } DEFAULT { printf ("illegal character: "); yyEcho; printf ("\n"); } DEFINE digit = {0-9} . letter = {a-z A-Z} . cmt = - {*(\t\n} . START comment RULE "(*" :- {++ level; yyStart (comment);} #comment# "*)" :- {-- level; if (level == 0) yyStart (STD);} #comment# "(" | "*" | cmt + :- {} /* The procedure PutString is imported from the module StringMem(ory). It is used to store the string representation of some tokens. */ #STD# digit + , #STD# digit + / ".." : {length = GetWord (Word); ref = PutString (Word, length); return 1;} #STD# {0-7} + B : {length = GetWord (Word); ref = PutString (Word, length); return 2;} #STD# {0-7} + C : {length = GetWord (Word); ref = PutString (Word, length); return 3;} #STD# digit {0-9 A-F} * H : { length = GetWord (Word); ref = PutString (Word, length); return 4;} #STD# digit + "." digit * (E {+\-} ? digit +) ? : { length = GetWord (Word); ref = PutString (Word, length); return 5;} #STD# ' - {\n'} * ' | \" - {\n"} * \" : {length = GetWord (Word); ref = PutString (Word, length); return 6;} #STD# "#" : {return 7;} #STD# "&" : {return 8;} #STD# "(" : {return 9;} #STD# ")" : {return 10;} #STD# "*" : {return 11;} #STD# "+" : {return 12;} #STD# "," : {return 13;} #STD# "-" : {return 14;} #STD# "." : {return 15;} #STD# ".." : {return 16;} #STD# "/" : {return 17;} #STD# ":" : {return 18;} #STD# ":=" : {return 19;} #STD# ";" : {return 20;} #STD# "<" : {return 21;} #STD# "<=" : {return 22;} #STD# "<>" : {return 23;} #STD# "=" : {return 24;} #STD# ">" : {return 25;} #STD# ">=" : {return 26;} #STD# "[" : {return 27;} #STD# "]" : {return 28;} #STD# "^" : {return 29;} #STD# "{" : {return 30;} #STD# "|" : {return 31;} #STD# "}" : {return 32;} #STD# "~" : {return 33;} #STD# AND : {return 34;} #STD# ARRAY : {return 35;} #STD# BEGIN : {return 36;} #STD# BY : {return 37;} #STD# CASE : {return 38;} #STD# CONST : {return 39;} #STD# DEFINITION : {return 40;} #STD# DIV : {return 41;} #STD# DO : {return 42;} #STD# ELSE : {return 43;} #STD# ELSIF : {return 44;} #STD# END : {return 45;} #STD# EXIT : {return 46;} #STD# EXPORT : {return 47;} #STD# FOR : {return 48;} #STD# FROM : {return 49;} #STD# IF : {return 50;} #STD# IMPLEMENTATION : {return 51;} #STD# IMPORT : {return 52;} #STD# IN : {return 53;} #STD# LOOP : {return 54;} #STD# MOD : {return 55;} #STD# MODULE : {return 56;} #STD# \NOT : {return 57;} #STD# OF : {return 58;} #STD# OR : {return 59;} #STD# POINTER : {return 60;} #STD# PROCEDURE : {return 61;} #STD# QUALIFIED : {return 62;} #STD# RECORD : {return 63;} #STD# REPEAT : {return 64;} #STD# RETURN : {return 65;} #STD# SET : {return 66;} #STD# THEN : {return 67;} #STD# TO : {return 68;} #STD# TYPE : {return 69;} #STD# UNTIL : {return 70;} #STD# VAR : {return 71;} #STD# WHILE : {return 72;} #STD# WITH : {return 73;} #STD# letter (letter | digit) * : { ident = MakeIdent (TokenPtr, TokenLength); return 74;} GLOBAL { FROM Strings IMPORT tString ; FROM StringMem IMPORT tStringRef , PutString ; FROM Idents IMPORT tIdent , MakeIdent ; VAR level : CARDINAL; PROCEDURE ErrorAttribute (Token: INTEGER; VAR Attribute: tScanAttribute); BEGIN END ErrorAttribute; } LOCAL { VAR Word : tString; ident : tIdent; ref : tStringRef; } BEGIN { level := 0; } DEFAULT { IO.WriteS (IO.StdOutput, "illegal character: "); yyEcho; IO.WriteNl (IO.StdOutput); } DEFINE digit = {0-9} . letter = {a-z A-Z} . cmt = - {*(\t\n} . START comment RULE "(*" :- {INC (level); yyStart (comment);} #comment# "*)" :- {DEC (level); IF level = 0 THEN yyStart (STD); END;} #comment# "(" | "*" | cmt + :- {} #STD# digit + , #STD# digit + / ".." : {GetWord (Word); ref := PutString (Word); RETURN 1;} #STD# {0-7} + B : {GetWord (Word); ref := PutString (Word); RETURN 2;} #STD# {0-7} + C : {GetWord (Word); ref := PutString (Word); RETURN 3;} #STD# digit {0-9 A-F} * H : { GetWord (Word); ref := PutString (Word); RETURN 4;} #STD# digit + "." digit * (E {+\-} ? digit +) ? : { GetWord (Word); ref := PutString (Word); RETURN 5;} #STD# ' - {\n'} * ' | \" - {\n"} * \" : {GetWord (Word); ref := PutString (Word); RETURN 6;} #STD# "#" : {RETURN 7;} #STD# "&" : {RETURN 8;} #STD# "(" : {RETURN 9;} #STD# ")" : {RETURN 10;} #STD# "*" : {RETURN 11;} #STD# "+" : {RETURN 12;} #STD# "," : {RETURN 13;} #STD# "-" : {RETURN 14;} #STD# "." : {RETURN 15;} #STD# ".." : {RETURN 16;} #STD# "/" : {RETURN 17;} #STD# ":" : {RETURN 18;} #STD# ":=" : {RETURN 19;} #STD# ";" : {RETURN 20;} #STD# "<" : {RETURN 21;} #STD# "<=" : {RETURN 22;} #STD# "<>" : {RETURN 23;} #STD# "=" : {RETURN 24;} #STD# ">" : {RETURN 25;} #STD# ">=" : {RETURN 26;} #STD# "[" : {RETURN 27;} #STD# "]" : {RETURN 28;} #STD# "^" : {RETURN 29;} #STD# "{" : {RETURN 30;} #STD# "|" : {RETURN 31;} #STD# "}" : {RETURN 32;} #STD# "~" : {RETURN 33;} #STD# AND : {RETURN 34;} #STD# ARRAY : {RETURN 35;} #STD# BEGIN : {RETURN 36;} #STD# BY : {RETURN 37;} #STD# CASE : {RETURN 38;} #STD# CONST : {RETURN 39;} #STD# DEFINITION : {RETURN 40;} #STD# DIV : {RETURN 41;} #STD# DO : {RETURN 42;} #STD# ELSE : {RETURN 43;} #STD# ELSIF : {RETURN 44;} #STD# END : {RETURN 45;} #STD# EXIT : {RETURN 46;} #STD# EXPORT : {RETURN 47;} #STD# FOR : {RETURN 48;} #STD# FROM : {RETURN 49;} #STD# IF : {RETURN 50;} #STD# IMPLEMENTATION : {RETURN 51;} #STD# IMPORT : {RETURN 52;} #STD# IN : {RETURN 53;} #STD# LOOP : {RETURN 54;} #STD# MOD : {RETURN 55;} #STD# MODULE : {RETURN 56;} #STD# \NOT : {RETURN 57;} #STD# OF : {RETURN 58;} #STD# OR : {RETURN 59;} #STD# POINTER : {RETURN 60;} #STD# PROCEDURE : {RETURN 61;} #STD# QUALIFIED : {RETURN 62;} #STD# RECORD : {RETURN 63;} #STD# REPEAT : {RETURN 64;} #STD# RETURN : {RETURN 65;} #STD# SET : {RETURN 66;} #STD# THEN : {RETURN 67;} #STD# TO : {RETURN 68;} #STD# TYPE : {RETURN 69;} #STD# UNTIL : {RETURN 70;} #STD# VAR : {RETURN 71;} #STD# WHILE : {RETURN 72;} #STD# WITH : {RETURN 73;} #STD# letter (letter | digit) * : { GetWord (Word); ident := MakeIdent (Word); RETURN 74;} EXPORT { FROM Idents IMPORT tIdent ; FROM StringMem IMPORT tStringRef; FROM Texts IMPORT tText ; FROM Positions IMPORT tPosition; TYPE tScanAttribute = RECORD Position : tPosition ; CASE : INTEGER OF | 1: Ident : tIdent ; | 2: Number : SHORTCARD ; | 3: String : tStringRef ; | 4: Ch : CHAR ; | 5: Text : tText ; END; END; PROCEDURE ErrorAttribute (Token: INTEGER; VAR Attribute: tScanAttribute); } GLOBAL { FROM SYSTEM IMPORT ADDRESS; FROM Strings IMPORT tString, Concatenate, Char, SubString, StringToInt, AssignEmpty, Length; FROM Texts IMPORT MakeText, Append; FROM StringMem IMPORT tStringRef, PutString; FROM Idents IMPORT tIdent, MakeIdent, NoIdent; FROM Errors IMPORT ErrorMessage, Error; FROM ScanGen IMPORT Language, tLanguage; FROM Positions IMPORT tPosition; CONST SymIdent = 1 ; SymNumber = 2 ; SymString = 3 ; SymChar = 4 ; SymTargetcode = 5 ; SymScanner = 37 ; SymExport = 32 ; SymGlobal = 6 ; SymLocal = 31 ; SymBegin = 7 ; SymClose = 8 ; SymEof = 34 ; SymDefault = 36 ; SymDefine = 9 ; SymStart = 10 ; SymRules = 11 ; SymNot = 30 ; SymDot = 12 ; SymComma = 13 ; SymEqual = 14 ; SymColon = 15 ; SymColonMinus = 35 ; SymNrSign = 33 ; SymSlash = 16 ; SymBar = 17 ; SymPlus = 18 ; SymMinus = 19 ; SymAsterisk = 20 ; SymQuestion = 21 ; SymLParen = 22 ; SymRParen = 23 ; SymLBracket = 24 ; SymRBracket = 25 ; SymLBrace = 26 ; SymRBrace = 27 ; SymLess = 28 ; SymGreater = 29 ; BraceMissing = 13 ; UnclosedComment = 14 ; UnclosedString = 16 ; VAR level : INTEGER ; string : tString ; NoString : tStringRef ; Position : tPosition ; PROCEDURE ErrorAttribute (Token: INTEGER; VAR Attribute: tScanAttribute); BEGIN CASE Token OF | SymIdent : Attribute.Ident := NoIdent; | SymNumber : Attribute.Number := 0; | SymString : Attribute.String := NoString; | SymChar : Attribute.Ch := '?'; | SymTargetcode : MakeText (Attribute.Text); ELSE END; END ErrorAttribute; } LOCAL { VAR TargetCode, String, Word: tString; PrevState: SHORTCARD; } BEGIN { level := 0; AssignEmpty (string); NoString := PutString (string); } EOF { CASE yyStartState OF | targetcode , set : ErrorMessage (BraceMissing , Error, Attribute.Position); | comment : ErrorMessage (UnclosedComment , Error, Attribute.Position); | CStr1, CStr2, Str1, Str2 : ErrorMessage (UnclosedString , Error, Attribute.Position); ELSE END; yyStart (STD); } DEFINE letter = {A-Z a-z} . digit = {0-9} . string = - {"\n} . cmtch = - {*\t\n} . code = - {{\}\t\n\\'"} . StrCh1 = - {'\t\n} . StrCh2 = - {"\t\n} . CStrCh1 = - {'\t\n\\} . CStrCh2 = - {"\t\n\\} . START targetcode, set, rules, comment, Str1, Str2, CStr1, CStr2 RULES #targetcode# "{" : { IF level = 0 THEN MakeText (Attribute.Text); AssignEmpty (TargetCode); Position := Attribute.Position; ELSE GetWord (Word); Concatenate (TargetCode, Word); END; INC (level); } #targetcode# "}" :- { DEC (level); IF level = 0 THEN yyStart (PrevState); Append (Attribute.Text, TargetCode); Attribute.Position := Position; RETURN SymTargetcode; ELSE GetWord (Word); Concatenate (TargetCode, Word); END; } #targetcode# code + :- { IF level > 0 THEN GetWord (Word); Concatenate (TargetCode, Word); END; } #targetcode# \t :- { IF level > 0 THEN Strings.Append (TargetCode, 11C); END; yyTab; } #targetcode# \n :- { IF level > 0 THEN Append (Attribute.Text, TargetCode); AssignEmpty (TargetCode); END; yyEol (0); } #targetcode# \\ ANY :- { IF level > 0 THEN GetWord (Word); Strings.Append (TargetCode, Char (Word, 2)); END; } #targetcode# \\ :- { IF level > 0 THEN Strings.Append (TargetCode, '\'); END; } #targetcode# ' : { GetWord (String); IF Language = C THEN yyStart (CStr1); ELSE yyStart (Str1); END; } #targetcode# \" : { GetWord (String); IF Language = C THEN yyStart (CStr2); ELSE yyStart (Str2); END; } #Str1# StrCh1 + , #Str2# StrCh2 + , #CStr1# CStrCh1 + | \\ ANY ? , #CStr2# CStrCh2 + | \\ ANY ? :- {GetWord (Word); Concatenate (String, Word);} #CStr1# \\ \n , #CStr2# \\ \n :- {GetWord (Word); Concatenate (String, Word); yyEol (0);} #Str1, CStr1# ' , #Str2, CStr2# \" :- {Strings.Append (String, Char (String, 1)); yyPrevious; Concatenate (TargetCode, String); } #Str1, Str2, CStr1, CStr2# \t :- {Strings.Append (String, 11C); yyTab;} #Str1, Str2, CStr1, CStr2# \n :- { ErrorMessage (UnclosedString, Error, Attribute.Position); Strings.Append (String, Char (String, 1)); yyEol (0); yyPrevious; Concatenate (TargetCode, String); } #STD, rules# "/*" :- {yyStart (comment) ;} #comment# "*" | cmtch + :- {} #comment# "*/" :- {yyPrevious ;} #STD# EXPORT : {PrevState := STD; yyStart (targetcode); RETURN SymExport ;} #STD# GLOBAL : {PrevState := STD; yyStart (targetcode); RETURN SymGlobal ;} #STD# LOCAL : {PrevState := STD; yyStart (targetcode); RETURN SymLocal ;} #STD# BEGIN : {PrevState := STD; yyStart (targetcode); RETURN SymBegin ;} #STD# CLOSE : {PrevState := STD; yyStart (targetcode); RETURN SymClose ;} #STD# DEFAULT : {PrevState := STD; yyStart (targetcode); RETURN SymDefault ;} #STD# EOF : {PrevState := STD; yyStart (targetcode); RETURN SymEof ;} #STD# SCANNER : {RETURN SymScanner ;} #STD# DEFINE : {RETURN SymDefine ;} #STD# START : {RETURN SymStart ;} #STD# RULE S ?: {yyStart (rules); RETURN SymRules ;} #rules# \NOT : {RETURN SymNot ;} #STD, rules# letter (letter | digit | _) * : { GetWord (Word); Attribute.Ident := MakeIdent (Word); RETURN SymIdent; } #STD, rules# digit + : { GetWord (Word); Attribute.Number := StringToInt (Word); RETURN SymNumber; } #STD, rules# \" string * \" : { GetWord (Word); SubString (Word, 2, Length (Word) - 1, TargetCode); Attribute.String := PutString (TargetCode); RETURN SymString; } #STD# "." : {RETURN SymDot ;} #STD# "=" : {RETURN SymEqual ;} #STD, set# "}" : {yyPrevious; RETURN SymRBrace ;} #STD, set, rules# "-" : {RETURN SymMinus ;} #STD, rules# "," : {RETURN SymComma ;} #STD, rules# "|" : {RETURN SymBar ;} #STD, rules# "+" : {RETURN SymPlus ;} #STD, rules# "*" : {RETURN SymAsterisk ;} #STD, rules# "?" : {RETURN SymQuestion ;} #STD, rules# "(" : {RETURN SymLParen ;} #STD, rules# ")" : {RETURN SymRParen ;} #STD, rules# "[" : {RETURN SymLBracket ;} #STD, rules# "]" : {RETURN SymRBracket ;} #STD, rules# "{" : {yyStart (set); RETURN SymLBrace ;} #rules# "#" : {RETURN SymNrSign ;} #rules# "/" : {RETURN SymSlash ;} #rules# "<" : {RETURN SymLess ;} #rules# ">" : {RETURN SymGreater ;} #rules# ":" : {PrevState := rules; yyStart (targetcode); RETURN SymColon ;} #rules# ":-" : {PrevState := rules; yyStart (targetcode); RETURN SymColonMinus ;} #STD, set, rules# \\ n : {Attribute.Ch := 012C; RETURN SymChar;} #STD, set, rules# \\ t : {Attribute.Ch := 011C; RETURN SymChar;} #STD, set, rules# \\ v : {Attribute.Ch := 013C; RETURN SymChar;} #STD, set, rules# \\ b : {Attribute.Ch := 010C; RETURN SymChar;} #STD, set, rules# \\ r : {Attribute.Ch := 015C; RETURN SymChar;} #STD, set, rules# \\ f : {Attribute.Ch := 014C; RETURN SymChar;} #STD, set, rules# \\ digit + : { GetWord (Word); SubString (Word, 2, Length (Word), TargetCode); Attribute.Ch := CHR (CARDINAL (StringToInt (TargetCode))); RETURN SymChar; } #STD, set, rules# \\ ANY : { GetWord (Word); Attribute.Ch := Char (Word, 2); RETURN SymChar; } #STD, set, rules# - {\t\n\ \f\r} : { GetWord (Word); Attribute.Ch := Char (Word, 1); RETURN SymChar; } \f :- {} \r :- {}
[ [ [ [